week 4: overfitting/mcmc

mcmc – brms

annoucement

Office hours will be on Tuesdays (10-12) starting next week.

10 islands, in a loop. Island 2 is twice as big as island 1, island 3 is three times as big as island 1, etc.

## record current position
    current = 5
  ## flip coin to generate proposal
    (coin = sample( c(-1,1) , size=1 ))
[1] -1
    (proposal = current + coin)
[1] 4
  ## move?
    (prob_move <- proposal/current)
[1] 0.8
    (current <- ifelse( runif(1) < prob_move , proposal , current ))
[1] 4

metropolis algorithm

num_weeks <- 1e5
positions <- rep(0,num_weeks)
current <- 10
for ( i in 1:num_weeks ) {
  ## record current position
    positions[i] <- current
  ## flip coin to generate proposal
    proposal <- current + sample( c(-1,1) , size=1 )
  ## now make sure he loops around the archipelago
    if ( proposal < 1 ) proposal <- 10
    if ( proposal > 10 ) proposal <- 1
  ## move?
    prob_move <- proposal/current
    current <- ifelse( runif(1) < prob_move , proposal , current )
}
data.frame(weeks = 1:1e5,positions) %>% 
  filter(weeks <=100) %>% 
  ggplot( aes(x=weeks, y=positions)) +
  geom_line()

gibbs sampling

It works by drawing samples from conditional distributions of each parameter given current values of all other parameters. The process iteratively updates one parameter at a time, eventually converging to the joint posterior distribution. Gibbs sampling is particularly effective for high-dimensional problems where direct sampling is difficult, and it’s computationally efficient because it only requires conditional distributions rather than the full joint distribution.

However, both Metropolis and Gibbs sampling are inefficient as models become more complex. For that reason, we’ll skip right ahead to Hamiltonian Monte Carlo sampling.

hamiltonian

  • this is really a physics simulation – think of a skateboarder in bowl. The skateboarder’s goal is to explore as much of the bowl as possible.
    • When the log-posterior is very flat, because there isn’t much information in the likelihood and the priors are rather flat, then the particle can glide for a long time before the slope (gradient) makes it turn around.
    • When instead the log-posterior is very steep, because either the likelihood or the priors are very concentrated, then the particle doesn’t get far before turning around.
    Iterations are the total number of runs you let the skateboarder take.
    • Every iteration is a complete journey around (or a part of) the bowl.
    • In each iteration, the skateboarder starts at some position, follows physics (using leapfrog steps), and proposes a new point.
    • More iterations = more chances to explore every part of the bowl.
  • HMC rejects some proposals.
    • What is the rejection criterion? Because HMC runs a physics simulation, certain things have to be conserved, like total energy of the system. When the total energy changes during the simulation, that means the numerical approximation is bad. When the approximation isn’t good, it might reject the proposal.

settings

  • LEAPFROG STEPS – Each path in the simulation is divided up into a number of leapfrog steps. If you choose many steps, the paths will be long. If you choose few, they will be short.
    • Each leapfrog step is like a small, controlled push that updates the skateboarder’s position (parameter values) and momentum (direction/speed of movement).
    • The skateboarder can’t just slide forever—they update their motion in little bursts.
    • At each leapfrog step, they briefly calculate where they are, how steep the slope is (gradient of the log posterior), and use that to push off in a slightly new direction
    • Look where the slope is heading → nudge momentum → roll forward → check again.
  • STEP SIZE – The step size is how big each leapfrog step is—how far the skateboarder moves before checking the slope again. A larger step size means fewer steps to get around the bowl, but they risk overshooting or taking turns too sharply. A smaller step size gives more precise control but takes longer to get anywhere—it’s more cautious.
    • Large step size: like pushing off hard and zooming forward—great when the terrain is flat.
    • Small step size: gentle push, perfect for when the bowl has tight curves and sudden drops.

The warmup period of Stan is figuring out what the leapfrog steps and step size should be. This uses an algorithm called a No-U-Turn Sampler (NUTS).

model specification

Let’s return to the height and weight data.

data(Howell1, package = "rethinking")
d <- Howell1
library(measurements)
d$height <- conv_unit(d$height, from = "cm", to = "feet")
d$weight <- conv_unit(d$weight, from = "kg", to = "lbs")
describe(d, fast = T)
       vars   n  mean    sd median  min    max range  skew kurtosis   se
height    1 544  4.54  0.91   4.88 1.77   5.88   4.1 -1.26     0.58 0.04
weight    2 544 78.51 32.45  88.31 9.37 138.87 129.5 -0.54    -0.94 1.39
age       3 544 29.34 20.75  27.00 0.00  88.00  88.0  0.49    -0.56 0.89
male      4 544  0.47  0.50   0.00 0.00   1.00   1.0  0.11    -1.99 0.02
d <- d[d$age >= 18, ]
d$height_c <- d$height - mean(d$height)

\[\begin{align*} w_i &\sim \text{Normal}(\mu_i, \sigma) \\ \mu_i &= \alpha + \beta (h_i - \bar{h}) \\ \alpha &\sim \text{Normal}(130, 20) \\ \beta &\sim \text{Normal}(0, 25) \\ \sigma &\sim \text{Uniform}(0, 25) \\ \end{align*}\]

m1 <-brm(
  data = d, 
  family = gaussian,
  weight ~ 1 + height_c,
  prior = c( prior( normal(130,20), class = Intercept),
             prior( normal(0,25), class = b),
             prior( uniform(0,50), class = sigma, ub = 50)
    ), 
  iter = 5000, warmup = 1000, chains = 4, 
  seed = 3, 
      file = here("files/data/generated_data/m1"))

brm() is the core function for fitting Bayesian models using brms. This function translates your model above into a Stan model, which in turn defines the sampler and does the hard work. This is running in the background of your computer. There are many programs that can be used to run Stan, and you can even run Stan directly if you want.

brms::stancode(m1)
// generated with brms 2.22.0
functions {
}
data {
  int<lower=1> N;  // total number of observations
  vector[N] Y;  // response variable
  int<lower=1> K;  // number of population-level effects
  matrix[N, K] X;  // population-level design matrix
  int<lower=1> Kc;  // number of population-level effects after centering
  int prior_only;  // should the likelihood be ignored?
}
transformed data {
  matrix[N, Kc] Xc;  // centered version of X without an intercept
  vector[Kc] means_X;  // column means of X before centering
  for (i in 2:K) {
    means_X[i - 1] = mean(X[, i]);
    Xc[, i - 1] = X[, i] - means_X[i - 1];
  }
}
parameters {
  vector[Kc] b;  // regression coefficients
  real Intercept;  // temporary intercept for centered predictors
  real<lower=0,upper=50> sigma;  // dispersion parameter
}
transformed parameters {
  real lprior = 0;  // prior contributions to the log posterior
  lprior += normal_lpdf(b | 0, 25);
  lprior += normal_lpdf(Intercept | 130, 20);
  lprior += uniform_lpdf(sigma | 0, 50)
    - 1 * log_diff_exp(uniform_lcdf(50 | 0, 50), uniform_lcdf(0 | 0, 50));
}
model {
  // likelihood including constants
  if (!prior_only) {
    target += normal_id_glm_lpdf(Y | Xc, Intercept, b, sigma);
  }
  // priors including constants
  target += lprior;
}
generated quantities {
  // actual population-level intercept
  real b_Intercept = Intercept - dot_product(means_X, b);
}
m1 <-brm(
  data = d, 
  family = gaussian,
  weight ~ 1 + height_c,
  prior = c( prior( normal(130,20), class = Intercept),
             prior( normal(0,25), class = b),
             prior( uniform(0,50), class = sigma, ub = 50)
    ), 
  iter = 5000, warmup = 1000, chains = 4,
  seed = 3, 
      file = here("files/data/generated_data/m42.1"))

family specifies the distribution of the outcome family. In many examples, we’ll use a gaussian (normal) distribution. But there are many many many options for this.

m1 <-brm(
  data = d, 
  family = gaussian,
  weight ~ 1 + height_c,
  prior = c( prior( normal(130,20), class = Intercept),
             prior( normal(0,25), class = b),
             prior( uniform(0,50), class = sigma, ub = 50)
    ), 
  iter = 5000, warmup = 1000, chains = 4,
  seed = 3, 
      file = here("files/data/generated_data/m1"))

The formula argument is what you would expect from the lm() and lmer() functions you have seen in the past. The benefit of brms is that this formula can easily handle complex and non-linear terms. We’ll be playing with more in future classes.

m1 <-brm(
  data = d, 
  family = gaussian,
  weight ~ 1 + height_c,
  prior = c( prior( normal(130,20), class = Intercept),
             prior( normal(0,25), class = b),
             prior( uniform(0,50), class = sigma, ub = 50)
    ), 
  iter = 5000, warmup = 1000, chains = 4,
  seed = 3, 
      file = here("files/data/generated_data/m1"))

Here we set our priors. Class b refers to slope parameters or beta coefficients. Again, this argument has the ability to become very detailed, specific, and flexible, and we’ll play more with this.

m1 <-brm(
  data = d, 
  family = gaussian,
  weight ~ 1 + height_c,
  prior = c( prior( normal(130,20), class = Intercept),
             prior( normal(0,25), class = b),
             prior( uniform(0,50), class = sigma, ub = 50)
    ), 
  iter = 5000, warmup = 1000, chains = 4,
  seed = 3, 
      file = here("files/data/generated_data/m1"))

Hamiltonian MCMC runs for a set number of iterations, throws away the first bit (the warmup), and does that up multiple times (the number of chains).

warmup defaults to iter/2, but you can set it to something else. Remember, the warmup is not simply a burn-in period. It’s used to figure out the appropriate leapfrog and step size settings. So it’s worth allowing this to be large-ish. Also note that the warmup period will generally run more slowly than the sampling period.

In general, it’s recommended that you use one short chain to debug, four longer chains for estimation, verification, and inference.

m1 <-brm(
  data = d, 
  family = gaussian,
  weight ~ 1 + height_c,
  prior = c( prior( normal(130,20), class = Intercept),
             prior( normal(0,25), class = b),
             prior( uniform(0,50), class = sigma, ub = 50)
    ), 
  iter = 5000, warmup = 1000, chains = 4,
  seed = 3, 
      file = here("files/data/generated_data/m1"))

Remember, these are random walks through parameter space, so set a seed for reproducbility. Also, these can take a while to run, especially when you are developing more complex models. If you specify a file, the output of the model will automatically be saved. Even better, then next time you run this code, R will check for that file and load it into your workspace instead of re-running the model. (Just be sure to delete the model file if you make changes to any other part of the code.)

one more option: cores

m1 <-brm(
  data = d, 
  family = gaussian,
  weight ~ 1 + height_c,
  prior = c( prior( normal(130,20), class = Intercept),
             prior( normal(0,25), class = b),
             prior( uniform(0,50), class = sigma, ub = 50)
    ), 
  iter = 5000, warmup = 1000, chains = 4,
  cores = 4,
  seed = 3, 
      file = here("files/data/generated_data/m1"))

You can parallelized your chains (run them in parallel instead of in sequence) by defining how many cores you want to run. Your computer (probably) has at least 4 cores, so 4 is a good choice here. But be warned that your computer may slow down in other places (while the code is running), and may even run a bit hot.

summary(m1)
 Family: gaussian 
  Links: mu = identity; sigma = identity 
Formula: weight ~ 1 + height_c 
   Data: d (Number of observations: 352) 
  Draws: 4 chains, each with iter = 5000; warmup = 1000; thin = 1;
         total post-warmup draws = 16000

Regression Coefficients:
          Estimate Est.Error l-95% CI u-95% CI Rhat Bulk_ESS Tail_ESS
Intercept    99.21      0.50    98.23   100.18 1.00    14807    12164
height_c     42.05      1.95    38.22    45.91 1.00    15921    12552

Further Distributional Parameters:
      Estimate Est.Error l-95% CI u-95% CI Rhat Bulk_ESS Tail_ESS
sigma     9.38      0.36     8.72    10.12 1.00    15626    12366

Draws were sampled using sampling(NUTS). For each parameter, Bulk_ESS
and Tail_ESS are effective sample size measures, and Rhat is the potential
scale reduction factor on split chains (at convergence, Rhat = 1).

diagnostics: mixed chains

To evaluate your model, the first question to ask is “did your chains mix?”

library(bayesplot)
mcmc_trace(m1)
mcmc_rank_overlay(m1, pars=vars(b_Intercept:sigma)) +ylim(150, NA)

diagnositics: effective sample size

This is not \(N\) like the number of participants in your sample. This is the number of draws from the posterior. The effective number is an estimate of the number of independent samples from the posterior (remember probability). Markov chains are typically autocorrelated, so that sequential samples are not entirely independent. This happens when chains explore the posterior slowly, like in a Metropolis algorithm. Autocorrelation reduces the effective number of samples. Think of effective sample size like the length of a Markov chain with no autocorrelation that is the same quality as your estimate.

brms gives you two estimates for effective sample size: bulk and tail.

How many do you need? It depends on how much skewed your posterior is.

summary(m1)
 Family: gaussian 
  Links: mu = identity; sigma = identity 
Formula: weight ~ 1 + height_c 
   Data: d (Number of observations: 352) 
  Draws: 4 chains, each with iter = 5000; warmup = 1000; thin = 1;
         total post-warmup draws = 16000

Regression Coefficients:
          Estimate Est.Error l-95% CI u-95% CI Rhat Bulk_ESS Tail_ESS
Intercept    99.21      0.50    98.23   100.18 1.00    14807    12164
height_c     42.05      1.95    38.22    45.91 1.00    15921    12552

Further Distributional Parameters:
      Estimate Est.Error l-95% CI u-95% CI Rhat Bulk_ESS Tail_ESS
sigma     9.38      0.36     8.72    10.12 1.00    15626    12366

Draws were sampled using sampling(NUTS). For each parameter, Bulk_ESS
and Tail_ESS are effective sample size measures, and Rhat is the potential
scale reduction factor on split chains (at convergence, Rhat = 1).

diagnostics: \(\hat{R}\)

The Gelman-Rubin convergence metric \((\hat{R})\) is a comparison of between-chain variance to within-chain variance. You want this number to be as close to 1 as you can.

Note that a value of 1 doesn’t mean that you’re fine. It’s a sign of danger, but not a sign of safety.

summary(m1)
 Family: gaussian 
  Links: mu = identity; sigma = identity 
Formula: weight ~ 1 + height_c 
   Data: d (Number of observations: 352) 
  Draws: 4 chains, each with iter = 5000; warmup = 1000; thin = 1;
         total post-warmup draws = 16000

Regression Coefficients:
          Estimate Est.Error l-95% CI u-95% CI Rhat Bulk_ESS Tail_ESS
Intercept    99.21      0.50    98.23   100.18 1.00    14807    12164
height_c     42.05      1.95    38.22    45.91 1.00    15921    12552

Further Distributional Parameters:
      Estimate Est.Error l-95% CI u-95% CI Rhat Bulk_ESS Tail_ESS
sigma     9.38      0.36     8.72    10.12 1.00    15626    12366

Draws were sampled using sampling(NUTS). For each parameter, Bulk_ESS
and Tail_ESS are effective sample size measures, and Rhat is the potential
scale reduction factor on split chains (at convergence, Rhat = 1).

divergent transitions

A rule in physics is that energy must be conserved. Remember that HMC is literally a physics simulation. Energy won’t be conserved in the system if your skateboarder zings out of the bowl and into outer space. That’s a divergent transition.

Two primary ways to handle divergent transitions are by increasing the adapt_delta parameter and reparameterization.

adapt_delta specifies the target average acceptance probability during the adaptation phase of sampling (warmup). The higher the delta, the smaller the step size. This makes the model more conservative, slower, and also much less likely to encounter problems. Its default is .8. Try increasing it to .9 or .99 if you have divergent transitions.

adapt_delta is like adjusting how smooth and controlled the skateboarder’s ride should be. * A higher adapt_delta (e.g., 0.99) means the skateboarder is super cautious, preferring smooth, stable rides (i.e., high acceptance rates). This leads to smaller step sizes and more leapfrog steps per iteration. * A lower adapt_delta (e.g., 0.8) lets the skateboarder be riskier, zooming around the bowl faster, but at a higher risk of crashing (divergent transitions).